Applications in Plant Sciences — Latest Matching Preprints

1

Making the most out of it: shallow genome-skimming possibilities for the systematics of prickly lineages of Solanum (Solanaceae)

Alves, R. T. d. L.; Gouvea, Y. F.; Dalapicolla, J.; Poczai, P.; Giacomin, L. L.

2026-07-09 plant biology 10.64898/2026.07.08.737304 medRxiv

Top 0.1%

37.9%

Show abstract

Premise: Genome skimming (GS) is a cost-effective approach for plant phylogenomics, but its ability to recover informative datasets from different genomic compartments, particularly genome-wide SNPs, remains poorly explored in Solanum. Methods: We evaluated shallow GS for phylogenetic inference in South American prickly Solanum lineages by recovering plastid, mitochondrial, and nuclear datasets, including coding regions and genome-wide SNPs. Phylogenies were inferred using maximum-likelihood and coalescent approaches under different SNP filtering strategies. Results: GS successfully recovered complete plastomes, organellar coding regions, and large SNP datasets, but failed to consistently assemble mitochondrial genomes or recover low-copy nuclear genes. SNP-based analyses, especially from the nuclear genome, produced stable, well-supported phylogenies that were largely congruent across inference methods. In contrast, coding-region datasets, particularly from the mitochondrial genome, showed greater topological discordance, revealing cytonuclear conflict. Discussion: Our results demonstrate that shallow GS is an effective strategy for generating informative SNP datasets for phylogenetic inference in Solanum, despite limitations in recovering complete mitochondrial genomes and low-copy nuclear loci. SNP-based analyses substantially expand the phylogenetic potential of GS, providing a practical and cost-effective alternative for systematic studies.

2

Methodological pitfalls in plant pangenome gene family identification may lead to biased evolutionary inferences

Liu, S.; Zhang, W.; Yu, P.

2026-05-18 genomics 10.64898/2026.05.15.725319 medRxiv

Top 0.1%

15.0%

Show abstract

Pangenome-level gene family identification often applies sequence similarity clustering without phylogenetic or synteny information, which risks biologically misleading evolutionary inferences. Using five transcription factor families (bHLH, MYB, NAC, WRKY, MADS-box) across 401 rice pangenome accessions, we compared clustering strategies: OrthoFinder alone, cd-hit alone, MMseqs2 alone, and OrthoFinder-informed refinement by cd-hit or MMseqs2. Methods solely based on sequence similarity merged distinct orthogroups and generated fewer orthogroups than approaches incorporating graph-based orthology. Conflicting cluster assignments, measured against OrthoFinder, varied strongly among families, from approximately 14% in MADS-box to approximately 57% in MYB, and were associated with protein length differences. Core, shell, and cloud gene classifications shifted substantially depending on the method, especially in MYB, NAC, and WRKY families. Critically, Ka/Ks distributions for core genes were highly method-sensitive, with orthology-aware methods yielding more convergent and less variable estimates of selective pressure, whereas noncore gene estimates remained robust. These findings demonstrate that neglecting graph-based orthogroup inference inflates methodological artifacts. We recommend a two-step strategy: initial graph-based orthogroup delineation followed by sequence similarity refinement to balance evolutionary accuracy and resolution in pangenome-scale gene family studies.

3

SeedMeasure: an efficient approach and open-source program to quantify seed size

Sims, B.;Gaudinier, A.;Blackman, B.

2026-06-29 Plant Biology 10.64898/2026.06.27.734974 medRxiv

Top 0.1%

10.1%

Show abstract

PremiseSeed size and morphology are critical traits in agriculture, ecology, and genetics, but high-throughput quantification of these traits is often limited by labor-intensive manual measurements or expensive, platform-specific imaging software. Methods and ResultsWe developed SeedMeasure, a lightweight, open-source, and cross-platform command-line tool written in Python that automates the measurement of seed area, length, and width from images. Using a simple imaging setup, the program processes images by correcting for perspective skew, filtering debris, and exports quantitative data alongside quality-check images. We validated SeedMeasure across nine diverse species, ranging from small Arabidopsis thaliana seeds to large Zea mays kernels. The tool quickly handles images using multithreading and demonstrates high reproducibility, yielding low coefficients of variation across repeated runs. ConclusionsCompared to existing software, SeedMeasure is free, offers faster processing through parallel computing, and provides standalone executables that require no programming dependencies. SeedMeasure offers an accessible, cost-effective, and high-throughput approach for rapid phenotypic profiling, making advanced seed morphological analysis available to researchers without specialized laboratory hardware.

4

EcoMorph: Universal morphological trait quantification from natural language prompts for ecological research

Amoah, E. I.; Bunch, Z.; Thomas, H. M.; Patch, H. M.; Grozinger, C.

2026-07-12 bioinformatics 10.64898/2026.07.10.737871 medRxiv

Top 0.1%

7.2%

Show abstract

0.O_LIMorphological traits such as floral area and body size are fundamental to ecological research, serving as inputs for studies of pollinator-plant interactions, habitat quality, and biodiversity monitoring. However, accurately measuring these traits from images remains challenging, particularly in complex field conditions where existing tools exhibit reduced accuracy and limited generalizability across taxa. C_LIO_LIWe present EcoMorph, a modular morphological measurement system that leverages the Segment Anything Model 3 (SAM3) to quantify traits across diverse ecological contexts. Unlike task-specific segmentation models requiring domain-specific training data, SAM3s prompt-based architecture enables segmentation of arbitrary biological structures from natural-language prompts, using the same underlying model across flowers, insects, and other targets without retraining. From the resulting segmentations, EcoMorph extracts three classes of measurement: area, linear dimensions, and object counts. C_LIO_LIWe validated EcoMorph across two ecological scales. At the intermediate scale, EcoMorph-derived floral area agreed closely with manual ImageJ measurements (R2 = 0.935, n = 74) under simple-background conditions and (R2 = 0.928, n = 58) under complex-background conditions, with valid predictions for 95% of images. At the fine scale, EcoMorph-derived insect body area was strongly correlated with hand-measured intertegular distance (r = 0.810, n = 349), capturing body-size variation across species from the small Bombus impatiens to the large Xylocopa virginica. Object counts matched manual counts almost exactly for well-separated insects in an insect box (R2 = 0.9997, n = 12). C_LIO_LIBy combining prompt-based segmentation with modular measurement, EcoMorph enables high-throughput quantification of area, size, and abundance from heterogeneous image sources without taxon-specific training. This generality supports a broad range of ecological applications, including pollinator and plant trait research, biodiversity and abundance monitoring, and allometric biomass estimation. C_LI

5

Resolving the oak tree of life: comparing RADseq and whole genome resequencing methods for oak phylogenetics

Hipp, A. L.; Althaus, K. N.; Fuller, E. L.; Hahn, M.; Larson, D. A.; Mohn, R. A.; Wang, B.; Manos, P. S.

2026-05-17 evolutionary biology 10.64898/2026.05.14.725274 medRxiv

Top 0.1%

6.7%

Show abstract

Forest trees pose numerous potential challenges to phylogenomic inference. Their large effective population sizes and relatively long generation times lead to deep allele coalescence and consequently incomplete lineage sorting (ILS), which biases inferences of divergence times toward older ages and introduces gene tree discordance. Deep phylogenetic divergences, reaching back into the Paleocene, introduce reference-mapping biases. Introgression--the movement of genes between lineages--may result in different phylogenies being inferred depending on which individuals are included in analysis, even if the plurality of the genome favors the divergence history unaffected by introgression. These factors influence phylogenetic inference across the Tree of Life but are particularly prevalent in forest trees. Oaks (Quercus) are notable for all three influences. In addition, our knowledge of the oak phylogeny is currently based strongly on restriction site associated DNA sequencing (RADseq) datasets published over the past decade, which may introduce additional sources of uncertainty. In this chapter, we analyze a 322-species RADseq dataset and genome resequencing data from across the genus to address sources of uncertainty in our understanding of the global oak phylogeny, which we hope will serve as a model for other research groups working on comparable woody plant groups.

6

A Practical Roadmap For Sampling Floral Nectar From Communities of Many Plant Species

Kirschke, G. E.; Bain, J. A.; Ogilvie, J. E.; CaraDonna, P. J.

2026-06-23 ecology 10.64898/2025.12.19.695174 medRxiv

Top 0.1%

6.7%

Show abstract

O_LIFloral nectar plays a critical role in shaping the ecology and evolution of plant-pollinator interactions. Effective and efficient methods that allow for broad-scale sampling of nectar volume and sugar concentration across a diversity of taxa are needed to improve our understanding of many dimensions of mutualistic plant-pollinator interactions--including their basic ecology and evolution, their responses to environmental change, and their conservation and restoration. C_LIO_LIDespite the key importance of nectar for mediating plant-pollinator interactions, quantifying floral nectar in the field from many different plant species is challenging because there is often no one-size-fits-all sampling method that is effective across a diversity of floral structures and nectar traits. Different methods require different preparation, and sampling from many species involves a variety of logistical challenges. C_LIO_LIHere we provide a methodological roadmap for sampling floral nectar in the field from many different plant species. We describe our nectar collection methods in detail, including necessary equipment, calculations, and approaches appropriate for different floral morphologies. We also provide a troubleshooting guide for common problems encountered while collecting nectar in the field. To demonstrate the utility and effectiveness of our methods for collecting nectar from many different species, we present results on nectar trait variation from 53 species in an ecosystem. C_LIO_LIOur method illustrates that nectar traits vary considerably within and among plant species, indicating that large-scale nectar sampling projects are an important consideration for many basic and applied questions in pollination ecology and evolution. We hope that across many plant communities and ecosystems, our paper provides a practical roadmap for how to navigate the complexities of quantifying floral nectar traits. C_LI

7

UVfinder: a tool to extract bryophyte sex-linked gene copies from the GoFlag408 probe set

Kim, S.; Bowman, J.; Braun, E. L.; McDaniel, S.

2026-07-07 bioinformatics 10.64898/2026.07.01.735932 medRxiv

Top 0.1%

6.6%

Show abstract

Target enrichment sequencing using probe sets like GoFlag 408 has revolutionized phylogenetics, yet recent genomic data indicate that some probes may be sex-linked, potentially introducing topological conflict while also allowing studies of sex-specific evolutionary processes. To test for sex-linkage across the bryophytes, we developed UVfinder, a pipeline designed to identify sex-linked GoFlag loci across published moss genomes and enable sex-aware downstream analyses. Applying UVfinder to 50 dioicous moss genomes, we identified 93 probes that exhibit sex-linkage in one or more lineages, providing genomic evidence for neo-sex chromosome formation via autosome-sex chromosome fusion and gene translocation. Furthermore, by comparing species trees derived from sex-linked versus autosomal loci in Hypnales and Dicranidae, we demonstrate that sex-linked loci harbor phylogenetic information that is distinct from that in autosomes. We also discovered a pervasive female sampling bias in the genomic data, perhaps reflecting a preference among collectors for plants with sporophytes. Ultimately, our findings highlight the dynamism in sex linkage across bryophytes and suggest that sex-aware phylogenomics can be used to reconstruct ancestral karyotypes and potentially resolve topological conflict. We expect that UVfinder will facilitate the further study of sex-specific evolutionary processes, particularly with improved genome assemblies and increased sampling in males.

8

Small representative samples can capture global vascular plant diversity patterns

Baldaszti, L.; Moonlight, P.; Brummitt, N.; Pironon, S.; Sarkinen, T.

2026-07-10 plant biology 10.64898/2026.07.08.737287 medRxiv

Top 0.1%

6.2%

Show abstract

Incomplete information on distributions for a high proportion of the world's plant species together with biases in global biodiversity data mean that current estimates of plant diversity patterns are skewed. A key issue is that current predictions rely on a subset of species that is not representative of all plant species. Here we tested the feasibility of a representative sampling approach for mapping global vascular plant diversity at the finest scale where comprehensive data is available. Using the World Checklist of Vascular Plants as a reference, we generate random samples of species with increasing sample sizes from the global species pool. We compare the diversity patterns retrieved from the samples against the patterns of the reference dataset using spatially weighted correlation coefficients and four different diversity metrics. We find that at the botanical country scale, representative global maps of species and phylogenetic diversity can be created with small numbers of species (~1% [0.2% and 0.4%, respectively]) at the botanical country scale. For effective growth form and family diversity sample sizes encompassing ~20% [19.2% and 19.5%, respectively] of all species are needed. Random samples require markedly fewer species to reach high correlations than when restricting the pool of species to single plant families or genera. We show that when representative samples are used robust inferences of plant diversity patterns can be made from only a small proportion of species.

9

Diversity Assessment with SNP, SSR, AFLP, and RAPD Markers in Plants: A Systematic Review and Meta-Analysis

Olagunju, Y. O.; Olawuyi, O. J.

2026-07-07 plant biology 10.64898/2026.07.03.736291 medRxiv

Top 0.1%

6.2%

Show abstract

Background. DNA-based molecular markers underpin plant genetic diversity assessment, germplasm characterisation, and conservation prioritisation. Four marker systems dominate the field: Amplified Fragment Length polymorphisms (AFLPs), simple sequence repeats (SSRs), single nucleotide polymorphisms (SNPs), and random amplified polymorphic DNA (RAPDs). No quantitative meta-analysis had pooled their performance on the canonical diversity metrics: polymorphism information content (PIC), expected heterozygosity (He), and resolution power, across plants. Existing reviews are narrative, marker-restricted, or qualitatively conclusive of infeasibility. Methods. A PRISMA 2020-compliant systematic review (registered at the Open Science Framework) was executed. Eligible studies were within-study paired comparisons genotyping the same accession panel with at least two of {SNP, SSR, AFLP, RAPD} and reporting at least one diversity metric. Effect sizes were paired standardised mean differences (Hedges' g) computed under the Bernoulli-variance approximation. Random-effects REML meta-analysis used metafor 5.0.1 with Knapp-Hartung adjustment, leave-one-out, and r-sensitivity. Results. Fifteen within-study paired contrasts were eligible, distributed across three pools. Pool 2 (SSR vs SNP, He, k = 5) yielded a pooled Hedges' g of 0.494 (95% CI: -0.078 to 1.066, p = 0.075; I-squared = 90.2%; 95% PI [-0.82, 1.81]). SSRs exceeded SNPs on He in 4 of 5 studies; leave-one-out removal of the panel-size-asymmetric outlier raised the estimate to g = 0.644 (p = 0.025). Pool 3a (dominant-marker stratum, k = 6) yielded g = 0.419 (95% CI: -0.121 to 0.960, p = 0.103; I-squared = 56.5%); five of six contrasts showed SSR or AFLP exceeding RAPD on per-locus PIC. Pool 1 (PIC, k = 3, exploratory) gave a consistent direction (g = 0.453). All three pools point in the same direction: codominant or AFLP markers carry more per-locus information than the alternative being compared. Conclusions. SSR markers reported higher per-locus diversity than SNP and RAPD markers in plant within-study paired comparisons, mechanistically grounded in the SNP biallelic ceiling and the multi-allelic richness of SSRs. The effect attenuated or reversed in selfing/low-diversity panels and at the per-panel level when SNP panels exceeded approximately 1000 loci. RAPDs show the lowest per-locus information content of the four classes.

10

MycorrhizaTracer: A BIOINFORMATIC PIPELINE FOR FUNGI AND PLANT CLASSIFICATION OF SANGER DNA SEQUENCES

Brekke, T. D.; Weeks, T.; Barber, R. A.; Thomson, I.; Gooda, R.; Gargiulo, R.; Delhaye, G.; Andrew, C.; Kowal, J.; Bidartondo, M.; Martinez-Suz, L.

2026-04-27 bioinformatics 10.64898/2026.04.23.720352 medRxiv

Top 0.1%

6.2%

Show abstract

Processing Sanger DNA sequences remains a routine yet technically demanding step in many biodiversity and ecological studies, particularly when barcoding large numbers of environmental samples. Manual inspection and editing of trace files, DNA sequence alignment, and classification using taxonomic reference databases is time-consuming, inconsistent, and prone to error. These challenges are compounded in studies involving degraded samples, in-house DNA sequencing, under-described taxa, or when investigators have limited access to computational tools. We present MycorrhizaTracer, an open-source, fully automated pipeline for processing and taxonomically classifying large batches of Sanger sequencing chromatograms. We have optimized it for fungal and plant taxa, but it is adaptable across the tree of life. The pipeline performs quality trimming, consensus generation from bidirectional reads, taxonomic classification via BLAST, clustering, optional salvaging of low-quality sequences, and functional annotation of fungal taxa. Designed for scalability and ease of use, MycorrhizaTracer can process thousands of DNA chromatograms in a matter of hours without the need for an HPC. Accuracy and ecological relevance are ensured by features such as gene region-specific taxonomic filtering and sequence-based clustering of unclassified reads. By streamlining trace-to-taxon workflows, MycorrhizaTracer reduces the burden of manual curation, supports reproducibility, and enables efficient recovery of biodiversity data from Sanger sequences - particularly in field-based or resource-limited research contexts.

11

Integrating lineage-specific and universal genomic probes illuminates phylogenetic relationships and molecular evolution in Sauvagesieae (Ochnaceae)

Reinales, S.; Forest, F.; Zuntini, A.; Cardoso, D.; Ballen, G. A.; Cardenas, D.; Pirani, J. R.

2026-05-01 genomics 10.64898/2026.04.29.721621 medRxiv

Top 0.1%

6.1%

Show abstract

Obtaining large and well-resolved phylogenetic trees for neotropical clades is challenging, as many species inhabit remote regions, and sampling often relies on herbarium specimens with highly degraded DNA. Target capture provides an effective solution for retrieving molecular data from fragmentary material. However, data processing using tools generally designed for diploid organisms and single-copy loci is also challenging, particularly when events such as genome duplication and hybridisation have shaped the lineage evolution. We used dual-hybridisation to integrate Ochnaceae-specific and universal probes to reconstruct the phylogenetic relationships of Sauvagesieae, a pantropical clade with ca. 90 species mainly distributed in the northern Andes, the Brazilian Espinhaco Range, and the Amazon-Guyana region. We tested different filtering strategies involving missing data and paralogs to assess probable sources of tree discordance and topological uncertainty. We found no significant benefit in reducing tree discordance after removing entire genes due to the presence of paralogs or a high amount of missing data. Removing fragmentary sequences instead improved alignments and increased branch support of gene trees. By quantifying the proportion of SNPs, analysing the distribution of the allele frequencies, and gene-tree quartet frequencies, we found evidence of polyploidisation and hybridisation, which could reduce resolution at internal nodes, particularly in mountain clades. Our results underscore the importance of exploring the complexities of target-capture data, not only to improve phylogenetic resolution but also to understand the sources of phylogenetic conflict and the underlying molecular evolutionary processes.

12

A Bayesian approach for identifying similar transcript dynamics using curve registration

Kristianingsih, R.; Calderwood, A.; Sidhu, G.; Woodhouse, S.; Woolfenden, H. C.; Kurup, S.; Wells, R.; Morris, R. J.

2026-04-29 bioinformatics 10.64898/2026.04.26.720911 medRxiv

Top 0.1%

5.5%

Show abstract

Changes in gene expression over time can provide valuable insights into developmental processes and responses to the environment. Differences in expression may be indicative of potential differences in regulation. Comparing transcript dynamics may help identify correspondences between developmental stages within and between species, differences in the timing of key events during development, and transcriptional response to treatments or perturbations. A straightforward comparison between the dynamics is, however, hindered by measurements that were taken at different time points and over different timescales. To address this, we developed a statistical approach that seeks the optimal alignment between two time series as a function of a temporal shift and stretch. We validated our approach using simulated data and applied it to several transcriptome datasets, including comparisons between different plant species. Our development facilitates knowledge transfer from model systems to less studied species, the identification of modules of co-regulated genes, and the discovery of condition-specific, temporally differentially-expressed genes. The method is provided freely available as an R package.

13

LIME: a fully automated pipeline for high-throughput quantification of leaf lesions

Tan, D.

2026-05-10 plant biology 10.64898/2026.05.07.723432 medRxiv

Top 0.1%

5.3%

Show abstract

Accurate quantification of leaf lesion severity is essential for plant disease research and phenotyping but is often limited by subjective visual scoring and time-intensive manual image analysis. We present LIME, a fully automated, open-source image analysis pipeline for high-throughput quantification of leaf lesions from disease assay images. LIME integrates zero-shot leaf segmentation using the Segment Anything Model with a convolutional neural network for lesion area estimation. Applied to Arabidopsis thaliana leaves infected with Sclerotinia sclerotiorum, the proposed approach achieved a mean absolute percentage error of 12.9%, comparable to observed intrarater variability in manual scoring. Stratified evaluation across lesion-size groups demonstrated consistent prediction accuracy for small, intermediate, and large lesions, and comparative analysis showed that the deep learning-based model substantially outperformed color-based baseline methods. Under GPU-accelerated execution, LIME processed complete assays containing approximately 200 leaves in 15 minutes, representing an approximate 13-fold reduction in processing time relative to manual annotation. Together, these results indicate that LIME enables objective, reproducible, and scalable quantification of leaf lesion severity in standardized plant pathology assays. The pipeline is released as an open-source tool to support quantitative phenotyping studies.

14

Near-Gapless and Haplotype-Resolved Capsella Genomes Enable Investigation into Genomic Consequences of Mating System Shifts

Chen, H.; Emmerson, R.; Mosher, R. A.

2026-07-10 plant biology 10.64898/2026.07.10.737683 medRxiv

Top 0.1%

5.3%

Show abstract

The shift from outcrossing to self-fertilization is a common evolutionary transition in flowering plants. The genus Capsella, comprising the obligate outcrosser C. grandiflora and two self-fertile species, C. rubella and C. orientalis, provides a powerful system to explore genomic consequences of mating system shifts. Despite its utility, existing genomic resources in Capsella are fragmented, incomplete, and particularly deficient in repetitive genomic regions, hindering the study of transposable element (TE) dynamics and gene annotation. Here, we present high-quality, chromosome-scale, near-gapless genome assemblies for C. grandiflora, C. rubella, and C. orientalis. Leveraging these improved genomes, we created high-quality genomic resources for the Capsella genus by performing comprehensive, de novo annotations of protein-coding genes and TEs. Comparative genomic analysis among these species reveals differences in TE abundance, position, and production of small RNAs. These resources provide an unprecedented opportunity to explore how mating system transitions influence genome architecture, TE behavior, and gene evolution. This research also developed a static online platform for Capsella genomic resources, Capsella Database (CapBase, www.capsella.uk), to support community use of these resources. Our findings advance understanding of the genomic impacts of selfing and establish a robust foundation for future research into genomics, epigenomics, and evolutionary biology within Capsella and related plant systems.

15

PhytoScan3D: an open-source Python pipeline for batch extraction of phenotypic traits from 3D point cloud files generated by multispectral plant phenotyping sensors

Kovi, M. R.; Leite, A. C.; Lillemo, M.

2026-06-04 plant biology 10.64898/2026.06.01.729298 medRxiv

Top 0.1%

4.8%

Show abstract

High-throughput 3D multispectral plant phenotyping platforms generate large volumes of point cloud files, but trait extraction is typically performed by sensor-bundled software whose internal algorithms are not publicly documented, which limits reproducibility and integration into custom research pipelines. Here we present PhytoScan3D, an open-source Python pipeline that extracts morphological and spectral phenotypic traits, spanning plant height, 3D leaf area, digital biomass, convex hull volume, leaf inclination, canopy geometry, NDVI, hue, and vegetation indices, from both PLY and PCD point cloud files generated by Phenospex PlantEye F500 and F600 sensors, and is portable to point clouds from any acquisition platform. PhytoScan3D was validated against HortControl (PhenoSpex) ground-truth measurements on 936 barley (Hordeum vulgare) pot-date observations from the growth chamber trial (20 Norwegian cultivars, 12 scan dates, Septemenr 2025 to January 2026), achieving Pearson r = 0.913 to 0.999 and ratio approximately 1.000 for Plant Height Max, 3D Leaf Area, and NDVI Average. A vectorised mesh face filtering implementation achieved a 120x speed improvement, increasing valid 3D Leaf Area coverage from 0.6% to 100% of files. Cross-format validation on 223 PlantEye F600 PCD files from the ICRISAT LeasyScan platform (four legume species: mungbean, cowpea, lima bean, and common bean; 1,523 plant observations) yielded r = 0.884 against independent cuboid annotation heights. The systematic positive bias (mean +27.2 mm, ratio = 1.44) is attributable to PhytoScan3D computing height from raw point cloud Z-range while cuboid annotations are fitted to segmented plant points only, with the offset consistent across all four species (per-species r = 0.880 to 0.888). Cross-dataset processing of 1,180 PLY files from the Crops3D benchmark (8 species, 3 acquisition methods) confirmed zero extraction errors. PhytoScan3D is available at "github.com/kovimallik/phytoscan3d" under the MIT licence and processes 1,651 files across three independent datasets in under 12 minutes on GPU hardware. HighlightsO_LIPhytoScan3D is the first open-source Python pipeline for batch extraction of phenotypic traits, including plant height, 3D leaf area, digital biomass, convex hull volume, leaf inclination, NDVI, and excess green index, from both PLY and PCD point cloud files generated by Phenospex PlantEye sensors. C_LIO_LIPrimary validation against HortControl ground-truth measurements on 936 barley pot-date observations achieved Pearson r = 0.913-0.999 for Plant Height Max, 3D Leaf Area, and NDVI Average. C_LIO_LIA 120x computational speedup in mesh face filtering (vectorised NumPy vs. set-based loop) increased the coverage of valid 3D Leaf Area extraction from 0.6% to 100% of files. C_LIO_LICross-format validation on 223 PlantEye F600 PCD files from ICRISAT LeasyScan (four legume species, 1,523 plants) achieved r = 0.884 against independent cuboid annotation heights. The systematic +27.2 mm bias reflects a methodological difference (raw Z-range vs. soil-segmented annotations), is consistent and predictable across all four species (per-species r = 0.880-0.888), and is correctable by a single linear factor. C_LIO_LICross-dataset processing of 1,180 PLY files from the Crops3D benchmark (8 species, 3 acquisition methods) confirmed zero extraction errors. C_LIO_LISignificant scan-unit variation was detected for Plant Height Max (F = 5.71, p < 0.001, 2 = 0.138) and Canopy Width X (F = 6.32, p < 0.001, 2 = 0.150), demonstrating the biological utility of extracted traits. C_LI

16

An evaluation of clustering and assembly strategies from Iso-Seq data in the absence of reference genomes in non-model animals

Eleftheriadi, K.; Vazquez-Valls, M.; Fernandez, R.

2026-07-08 evolutionary biology 10.1101/2025.09.18.677004 medRxiv

Top 0.1%

4.7%

Show abstract

Transcriptome assembly enables the recovery of expressed genes and isoforms, but the optimal strategy for reconstructing transcriptomes from long-read sequencing remains unresolved. In particular, establishing best practices for generating accurate gene models and selecting representative isoforms is essential for comparative genomics, as for orthology inference typically only the longest isoform per gene model is included. Here, we systematically compare clustering and de novo assembly methods using PacBio Iso-Seq data from 17 animal lineages spanning seven phyla, most of them non-model species, with the goal of investigating which methodology is more adequate to select one isoform per gene model, in the absence of specific pipelines to do so. We evaluate four approaches: isoseq cluster, CD-HIT, RNA-Bloom2 and isONform. We benchmark them with short-reads using Trinity, assessing assembly quality with BUSCO completeness, short-read mapping rates, coding sequence recovery, and longest isoform prediction. Our results show that CD-HIT clustering at high similarity thresholds ([≥]99%) yields the most complete and coding-rich long-read transcriptomes, rivaling Trinity while avoiding its high redundancy. Consensus-based methods such as isoseq cluster and isONform recover fewer single-copy orthologs (mirrored in a lower BUSCO score) and achieve lower mapping rates, while RNA-Bloom2 provide intermediate performance with reduced duplication. Together, these findings establish, to date, CD-HIT as a robust and practical strategy for transcriptome reconstruction from long-read data when genomic references are unavailable. By benchmarking de novo methods across a taxonomically broad dataset, this work defines the realistic capabilities of long-read transcriptome reconstruction in the absence of a reference genome and provides practical guidance for deriving high-quality gene models and selecting representative isoforms for orthology inference in non-model species.

17

Near chromosome-level genome assembly for the invasive annual forb Centaurea melitensis

Dant, A.; Pelosi, J.; Northing, P. C.; Dlugosch, K. M.

2026-05-20 genomics 10.64898/2026.05.18.726060 medRxiv

Top 0.1%

4.4%

Show abstract

PremiseCentaurea melitensis (Asteraceae) is a problematic invader of grasslands globally, but little is known about its genetic makeup. Here we develop a reference genome to facilitate studies of its invasion history, genetic variation, and evolution. MethodsInbred offspring of a single individual of C. melitensis from its invasion of California, USA were used for flow cytometry to estimate genome size, and for genomic DNA extraction. DNA was sequenced with PacBio HiFi technology (yield = 85.7Gb). The genome was assembled with Hifiasm and annotated with BRAKER3. GENESPACE was used to compare gene order (synteny) with three other species within the subfamily Cichorioideae. ResultsWe estimated a mean genome size of 795.0 Mbp for C. melitensis, and our assembly totaled 696.6 Mbp in 48 contigs (N50 = 55.6 Mbp; BUSCO = 98%), with annotation of 25,157 protein-encoding genes. This included four telomere-to-telomere putative chromosomes, nine additional chromosome arms terminated by telomeric repeats, and a complete chloroplast genome. Synteny varied markedly across the genus and subfamily, suggesting a dynamic history of structural variation in the lineage of C. melitensis. DiscussionWe provide a highly complete and contiguous genome assembly to facilitate the further study of genomic variation in C. melitensis.

18

Elab2ARC: A Browser-Based Workspace for Converting Free-Text Protocols into rich FAIR digital objects

Zander, S.; Zhou, X.-R.; Kranz, A.; Dumschott, K.; Rocca-Serra, P.; Weil, H. L.; Tschoepke, M.; Muehlhaus, T.; Von Suchodoletz, D.; Usadel, B.

2026-05-18 bioinformatics 10.64898/2026.05.14.724833 medRxiv

Top 0.1%

3.4%

Show abstract

Electronic laboratory notebooks (ELNs) are widely used in the life sciences, but their notebook format limits machine-readability and FAIR compliance. Consequently, researchers often spend significant manual effort restructuring ELN records into publication-ready outputs. We present elab2ARC, a browser-based workspace that automates the conversion of open-source eLabFTW records into Annotated Research Contexts (ARCs)-- version-controlled, ISA-compliant research objects. Using the eLabFTW API, elab2ARC retrieves administrative metadata, protocols, and attachments, reorganising them into ISA-compliant tables and linked datasets. All processing occurs client-side, ensuring user data control before submission to the PLANTdataHUB repository. An optional LLM-assisted workflow extracts structured metadata from free-text protocols, providing editable drafts while preserving human oversight. Designed for use at project completion, elab2ARC reuses existing ELN documentation without disrupting daily laboratory practice. It offers a practical route to FAIR-aligned sharing, publication, and long-term archiving of life-science experimental records. Availability and implementationelab2ARC is freely accessible at https://nfdi4plants.org/elab2arc/. The source code is available at https://github.com/nfdi4plants/elab2arc under a GPL-3.0 license. Supplementary informationSupplementary data are available online.

19

ZonationR: An R interface to the Zonation software for reproducible spatial conservation prioritisation workflows

Cavalcante, T.; Ribeiro, B.; Guidoni-Martins, K.; Kujala, H.

2026-04-30 bioinformatics 10.64898/2026.04.28.720523 medRxiv

Top 0.1%

3.3%

Show abstract

Systematic conservation planning provides a science-based framework for defining conservation goals and supporting transparent spatial decision-making under limited resources. Within this framework, spatial conservation prioritisation tools are widely used to identify areas of high biodiversity value by integrating information on species distributions, connectivity, costs, and other factors into spatially explicit recommendations. Zonation is one of the leading software tools in this field, producing hierarchical priority rankings of landscapes based on conservation value. However, its standard workflow typically relies on manual steps for data preparation, execution, and post-processing, which can become inefficient and difficult to reproduce when multiple scenarios are analysed, limiting accessibility and broader uptake. We introduce ZonationR, an R package that provides a streamlined interface to the Zonation software, enabling fully reproducible and automated spatial prioritisation workflows. The package integrates the entire analysis pipeline, encompassing input preparation, execution of Zonation, and post-processing, while supporting both single-variant and multi-variant workflows. ZonationR also provides tools to explore and interpret outputs, including priority maps, feature performance curves, cost summaries, feature representation metrics, and similarity assessments between prioritisation solutions. By linking directly to the original Zonation engine, the package enables users to benefit from ongoing methodological developments in Zonation and access its functionality through a transparent, script-based workflow, thereby reducing technical barriers to running and understanding spatial prioritisation analyses. Beyond these advantages, its integration within the R environment supports iterative testing of conservation scenarios and more rigorous assessment of methodological decisions, while facilitating seamless connections with wider ecological workflows (e.g., species distribution modelling). As conservation planning increasingly relies on large, complex, multi-source datasets and integrative approaches, such integration is essential for enabling robust, transparent, and reproducible decision-making across spatial scales.

20

trAIt: Species-by-Trait Data Retrieval using Large Language Models

Balaji, S.; Martinson, K. A.; Schellenberger, J. S.; Koley, J.; Inman, C. M.; Hofmann, H. A.; Young, R. L.; Harpak, A.

2026-06-24 bioinformatics 10.64898/2026.06.19.732660 medRxiv

Top 0.1%

3.3%

Show abstract

Biological research often requires information about species traits. Manual literature collation can be time-consuming and miss parts of the literature. To address this gap, we developed trAIt, a publicly available software for the retrieval of characteristics of species from scientific literature catalogued in the Europe PubMed Central (PubMed) database. trAIt provides a graphical user interface (GUI) in which users specify species and characteristics of interest. Leveraging a large language model (LLM), trAIt retrieves relevant papers, combines their content through a consensus-based summarization model, and outputs a species-by-characteristic table. For a case study involving frog species, trAIt recovered 47.1% of trait-species combinations in 2.75 hours, while an expert curator independently recovered 62.4% over months. The consensus-based summarization substantially aids accuracy compared to single-source extraction. Across three case studies of vertebrate taxa, an expert confirmed the accuracy of 70.9% of trait-species entries recovered by trAIt. We observed considerable variation across taxa in trAIts accuracy, which is possibly due to heterogeneity in open-access literature availability and inconsistencies in species and trait terminology. In sum, our analysis suggests that LLM-based tools can accelerate biological data synthesis but should be used to support domain experts research, rather than replace their judgment.